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Abstract 

We consider the problem of estimating a low rank covariance function K{t, u) 
of a Gaussian process S(t),t e [0,1] based on n i.i.d. copies of S observed in a 
white noise. We suggest a new estimation procedure adapting simultaneously 
to the low rank structure and the smoothness of the covariance function. The 
new procedure is based on nuclear norm penalization and exhibits superior 
performances as compared to the sample covariance function by a polynomial 
factor in the sample size n. Other results include a minimax lower bound 
for estimation of low-rank covariance functions showing that our procedure 
is optimal as well as a scheme to estimate the unknown noise variance of the 
Gaussian process. 

Keywords: Gaussian process, Low rank Govariance Function, Nuclear 
norm. Empirical risk minimization. Minimax lower bounds. Adaptation 


1. Introduction 

Let X(t),t G [0,1] be a Gaussian process satisfying the following stochas¬ 
tic differential equation: 

dX{t) = S{t)dt + adW{t), t e [0,1], (1) 
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where W is the standard Brownian motion, a > 0 is the noise level, and 

r 

S{t) = te[0,l]. 

k=l 

Here are i.i.d. standard Gaussian random variables independent of the 
Brownian motion W, {(pkYk=i unknown orthonormal functions in L2[0,1], 
possibly, with r = oo, and the coefficients > 0 are unknown and such that 
J2k=i Afc < C) 0 . The value of r is also unknown. 

Assume that we observe n i.i.d. copies Xi(t),..., X^it) of the process 
X{t). In this paper, we study the problem of estimation of the covariance 
function of the stochastic process S{-), 

r 

K{t,u) =E{S{t)S{u)) = '^Xk(pk{t)(pk{u), t,Me[0,l], (2) 

k=l 

based on the observations {Xi(f),... ,Xnit),t G [0,1]}. If r = cxo, the sum 
in ([2]) is understood in the sense of T2([0,1] x [0, l])-convergence. In short, 
([T]) is a model of a “signal” (Gaussian stochastic process S) observed in a 
Gaussian white noise and the goal is to estimate the covariance of the signal 
based on a sample of such observations. 

Statistical estimation of covariance functions has already received some 
attention in the literature. However, somewhat different setting was consid¬ 
ered where the trajectories Xi{-) are observed at discrete time locations: 


f i,j ^ 2, 




where are i.i.d. A/'(0,1) and, for each i, the points Tjj, 1 < j < m, are 
equispaced in the interval [0,1] or indepe ndent random va riables with uni¬ 
form distribution on [0,1]. In this setting, lYao et al.l (120051) proposed a local 
smoothing estimation procedure assuming that the trajectories Xi{-) are well 
approximated by the projection on the linear span of functions <pi,..., for 
some known hxed k chosen by cross-validation. This procedure is computa¬ 
tionally intensive as it requires to compute the eigenvalues and the inverse 
for n distinct m x m empirical covariance matrices of the traje ctories Xj, 
1 < i < n, at each of the cross-validation steps. The results in lYao et al 


(120051) provide theoretical guarantees for estimation of the covariance func¬ 
tion and its eigenfunctions under the c ondit ion that the previous approxima¬ 
tion is sufficiently precise. iHall et al.l (120061) consider the same methodology 
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and study the effect of the sampli ng rate on the estimatio n rate of the eigen¬ 
functions. In a similar framework, iRnnea and Xiaol (120131) propose a simpler 
procedure to estimate the eigenfunctions and obtain theoretical guarantees 
on the estimation error. Their approach involves a dimension reduction step 
where the selection of the relevant eigenfunctions is performed by threshold¬ 
ing the eigenvalues o f a correctly const ructed empirical covariance matrix. 
In a similar setting, Bigot et ah f 2010l) consider the estimation of the co¬ 


variance matrix of the process S at sample points rather than that of the 
cov ariance funct i on. T his problem can be reduced to multivariate regression 
and Bigot et ah f 2010h develop a model selection approach to it resulting in 
some oracle inequalities. 

Noteworthy, strong regularity conditi ons are usually im posed on the eigen¬ 
functions (pk in the existing literature. In iHall et al.l (120061) the eigenfunctions 
are assumed to admit bounded derivatives of order at least two. In addi- 
ti on, the opt i mal bandwidth choice in the local smoothing approach used 


m 


Hall et al.l (120061) ; lYao et al.l (l2005l) requires th e know ledge of smoothness 


degree of the eigenfunctions. In Bnnea and Xiaol ( 20131) . the eigenfunctions 
are assumed to be continuously differentiable with bounded derivatives, the 
sequence of eigenvalues belongs to a Sobolev ball with regularity (3 > 0 and 
the optimal choice of the threshold in the dimension reduction step depends 
on [3. 

An interesting question is what are the optimal rates of estimation of 
the covariance function in a minimax sense. To our knowledge, it was not 
addressed in the literature. 

In this paper, we assume that the trajectories Xi{-) are fully observed in 
time. Our aim is to understand the influence of the structure of the covariance 
function K on the estimation rate. The main contributions of this paper are 
as follows: 


1. We propose a simple data-driven procedure to estimate the covariance 
function and prove oracle inequalities for it based on recent results on 
high-dimensional matrix estimation. 

2. We show that the proposed method is minimax optimal for estimation 
of K in the L 2 -norm whereas the empirical covariance estimator is 
suboptimal. 
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2. Definitions and notations 


Let ei(-), e 2 (-)) • • • be an orthonormal basis of -L2[0,1], which is assnmed to 
be fixed thronghont the paper. Denote by 11-112 the norms either of L2[0,1] or 
of L2([0, 1] X [0,1]) (according to the context) and by (-, •) the corresponding 
inner products. For any integer I > 1, consider the orthogonal projection 
sm = S)ek of S onto the linear span of {ei,..., ei\. Set 

= ^k{t)dX{t) Cfc, iy« = Y. f ^k{t)dW{t) Ck. (3) 

k=l k=l 

In view of ([T]), we have 

Similarly to (E]), we define the processes 

I 

Xi^ = ^ / ek{t)dXi{t) Cfc, i = 1,..., n, 

and consider the empirical covariance function 

1 " 

^ i,«e 10,1|. 

Note that the expectation of Rn^ {t, u) is 

= u) + u), 

with = Yj\=i^k{t)ek{u) and 

77 ? = 1 

where (pm = Ylk=ii^k, ^m)ek is the orthogonal projection of (pm onto the lin¬ 
ear span of {ci,..., 6/}. In what follows, we will consider the set of functions 

Si \ ^ ^ ® ^kji ji^ 1) • • • ) ^ C 

D>=i ^ 
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where {ej<^ek)(t,s) = ej(t)ek{s). The set Si consists of all symmetric kernels 
belonging to the linear span of {ej Ck '■ j, k = 1,... ,1}. Note that K is not 
necessarily in Si while Rn\ G Si. It is easy to see that is the 

orthogonal projection of K onto Si. 

If no ambiguity is caused, for any A ^ Si, we will use the same symbol A 
to denote the corresponding symmetric I x I matrix. For any function A ^ Si 
or any I x / matrix A we denote by ||A||i and ||A||oo its nuclear and spectral 
norms, respectively. The trace and the rank of matrix A are denoted by 
tr(y4) and rank(y4), and its Frobenius norm by ||^||f- Writing A > 0 for a 
matrix A means that A is non-negative dehnite. 


3. Nuclear norm penalized estimator and its convergence rate 

In this section, we assume that the noise level a is known. For an integer 
/ > 1, we define the estimator A^’'^ of iF as a solution of the following penalized 
minimization problem 

i(') e argmin^g5^_^>o ^ , (4) 


where /i > 0 is a regularization parameter to be tuned. Note that here we 
have ll^lli = tr(A). The solution of dlj) is explicitly expressed via soft thesh- 
olding of the eigenvalues of the matrix (cf. Koltchinskii et ah 


(1201111 1. The next theorem easily follow s from t he argument in the proof of 


Theorem 1 in Koltchinskii et ah f 201l[l (see also Lonnici f 2014Jl ). 


Theorem 1. Let n,l >lhe integers and let ..., Xn{-) be i.i.d. real¬ 

izations of the process JF(-) satisfying (ll]j. If ja > 2||i?n^ — 
then, for any K satisfying (|2]) with J2k=i Afc < cxo we have 


- K\\l < inf 

A&Si,A>0 


\\A-K\\l + mm<2fi\\A\\^ 


{i + V2y 


■/i^rank(A) 


This theorem is a deterministic fact as soon as we have a proper bound on 
a single random variable, namely, the spectral norm ||i?n^ — r'>iu. 

In other words, all stochastic effects in our problem are localized in the 
behaviour of this random variable and the choice of p is driven by it as well. 
The next lemma provides a probabilistic bound on this random variable. 
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Lemma 2. Let n,l > 1 be integers and let , Xn {-) be i.i.d. real¬ 
izations of the proeess X(-) satisfying (If). Set A^ax = For any 

t > 0 and I > 1, define 

6n{l,t) = maxi . (5) 

1 V n n \ 

Then, with probability at least for any K satisfying ([2]) with Ylk=i < 

oo we have 


\\rT _ K^i) _ < C(A„,ax + a^)Sn{l,t), 


for some absolute constant C > 0. 

Proof. Set Xj(/) = (f^ ei(t)dXi(t), ■ ■ ■, Jq ei{t)dXi{t))~^ for any 1 < i < n and 
Bn,i = Note that xfl) are i.i.d. normal random vectors 

with mean 0 and covariance matrix Bi = + a Also \\R^n^ - - 

= \\Bn,i — BiWoo- Here, is the I x I identity matrix. Next, 


Bn,i—Bi\\ao < \\B, 


llcx) 


-^z,zj-im 

n 


i=l 


< (•^max+0'^) 


- ^ 

n 


2=1 


where Z\,... ,Zn are i.i.d. standard normal vectors in IRh Here we also used 
the fact that the following representation holds for random vectors Xj(/) : 
Xj(/) = B^^^Zi. Applying Theorem 5.39 in Vershvnin ( 2012 ) to the random 
variable ||^ Er=i we get the result. □ 

Theorem [U with Lemma [2] immediately imply the following result. 


Theorem 3. Let n,l > 1 be integers and let Xi{-),... ,Xn{-) be i.i.d. real¬ 
izations of the process X(-) satisfying (If. Take 

T = c(Amax + (T^)5n{l,t), 


for some sufficiently large absolute constant c > 0. Define 

Vn{AJ,t) = min {(Amax + cr^)tr(A)(5„(/, t), (A^ax + cr^)^rank(A)5^(/, f)} . 

Let t > 0. Then, with probability at least 1 — e“*, for any K satisfying ([2]) 
with J2k=i ^k < oo we have 

mf {P-if||^ + CK„(4!.i)} (6) 

with some absolute constant C > 0. 
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The bound ([2]) is the main oracle inequality that we will use now to obtain 
minimax bounds on the risk of the estimator It is easy to check that 

Vn{A,l,t) < (Amax + o-^)^rank(A)^-^. 

n 

The above bound is trivial if / +1 < n. In the case / + f > n, it follows from 
the bound 

(Amax+o-^)tr(A)^-^ < (Amax+cr^)Amaxrank(A)^-^ < (Amax+0‘^)^rank(A)^-^ 
n n n 

Combining Theorem |3] with the fact that, for a random variable 77 , IE[|77|] = 
Jq°°P(|? 7 | > t)dt and taking A = K^’‘\ 

E[||i« - K\\l] < ||i^« - K\\l + (7) 

for some absolute constant C > 0 , where we have used that rank(iC(^^) < rf\l. 
This inequality is valid for all K of the form ([2]), with hnite or inhnite r. 

As a corollary, we get the following bound on the minimax risk over the 
class of covariance functions that admit a hnite expansion with respect to 
the basis {ck}. Denote by the class of all covariance functions 

satisfying ([2]) such that K G Si and ||iC||oo < A^ax where A^ax is a hnite 
positive constant. Note that the system of functions {^k} ia this dehnition 
is not hxed and varies among all orthonormal systems in T 2 [ 0 ,1]. 

Corollary 4. Under the assumptions of Theorem 0 we have 
sup E||| Y' - K\\l] < C(A„„ + 
for some absolute constant C > 0. 

It is interesting to compare the estimator with the other natural 
estimator, which is the corrected empirical covariance function 

J{i) A Rd) 

We have the following expression for the risk of A^^\ 
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Proposition 5. For any K satisfying (|2]) with J2k=i < oo we have 


- K\\l] = - K\\l + 

where Bi = 

Proof. Set for brevity B = Bi, Bn = Bn,i, x* = Xj(/). Note that £( 74 *^^^) = 
K^’'\ The bias-variance decomposition of the risk of yields 

E|||/il" - K\\l] = ll/fl'l - K\\l + Elllfii'l - E(fl<'))|| J. 

Here, E|||fl!;> - E(fli'>)||i] = E|||B„ - B|||,| = E[||i Eii = 

XjxJ — E[xjxJ]. Since the matrices Wi are i.i.d. we hnd E[||i YAi=i ~ 

Etr(^E-,=iW^^W^i) = itr(E(fP7iyi)) = ^{Ei\x,\i)-tTiB^B)) where M 2 
denotes the Euclidean norm. Here, E(|xi| 2 ) — tT{B~^B) = \\B\\l + [tr(i?)]^ 
and the result follows. □ 

Since tr(i?) > a'^l, Proposition [5] implies 

E[p® - /fill] > ||/r‘" - K\\l + (8) 

mfE|p®-/f||l]>M- (9) 

where inf;^ is the inhmum over all K satisfying ([2]) with X]fc=i < 00 . 

Comparing (j9]) with CorollarylHwe see that the risk of the empirical estimator 

Ah) on the class ]Cr,i is of the order greater than the risk of our estimator 
Ah) when r is smaller than /. 

Our estimator also outperforms the estimator Ah) for kernels K that 
do not admit a hnite expansion with respect to the basis {cfc}, but satisfy 
some regularity conditions. To this end, we introduce a specihc norm that 
can be naturally interpreted as a version of the Sobolev norm for covariance 
functions. Fix the smoothness parameter s > 0. For any symmetric function 
K : [0,1]^ —)■ IR, we dehne 



1/2 





where A is an operator admitting the matrix representation diag(l, 2, • • • , A;, • • 
w.r.t the basis {ek)k>i- Note that the norm ||-ft"||s ,2 depends on the basis {e^} 
but we do not indicate this dependence in the notation since {cfc} is fixed. 
Note also that if K admits spectral representation ([2]), then 


1/2 


^,2 


1/2 


\fe=i 

where we use the notation 

||/?||.,2 = IIAVII = 

\fc>i 

for a Sobolev type norm of a function if : [0,1] ^ IR- 

Assumption 6. Suppose the covariance function K has finite rank r and 
there exist constants Amax > 0, s > 0 and p > I such that ||A"||oo < Amax and 

IIA'II.,2 < p- 


El 


Denote by ]Cr{s,p; Amax) the class of all kernels K satisfying Assumption 


Theorem 7. Given r>l,s>0,p>0 and Amax > 0, set 


i := max 


2 \ l/(2s+l)' 

p n\ 


.(Vax + cr2)2 r J 
Then, with some absolute constant C > 0, 
_sup E[||iW-A||2] < 

{'SjPjAxnax) 
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p n 


(Amax + 


l/(2s+2)' 


( 10 ) 


Proof. Since K satisfies Assumption El we have for any I > 1 that 


k>l+l k'=l k'>l+l k=l 


s/(s+l) 
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<(/+l)-2^ ^ J2k^^{Kek,ek'f+{l+l)-^^ J2 Y.^kT{Kek,ek'f <2pH-^^. 

k>l^l k'=l fc^>/+l k=l 

Combining the previous display with ([7]), we hnd that, for any / > 1, 

E[||i<‘l - /fill] < + C(A„„ + , 

The minimum of the right-hand side of this inequality is achieved for I of the 
order of i. By setting I = £, we obtain (ITOD . □ 


Note that, if the rank r is small, the problem of estimation of covariance 
function K reduces to estimation of a small number r of eigenfunctions and 
eigenvalues of K. The rate in ffTOj) is, in this case, of the order 
which coincides with a standard minimax error rate of estimation of a func¬ 
tion of one variable of smoothness s. On the other hand, when the rank r is 
large (say, r = -|-cxo), the estimation error rate becomes which is 

the minimax rate of estimation of a function of two variables of smoothness 
s. Similar error rates where studi ed earlier in matrix completion problems 
for smooth kernels on graphs (see iKoltchinskii and Rangel (120131 )). 


We consider now a class of kernels determined by the following assump¬ 
tion, which can be interpreted as a Sobolev type condition on the individual 
eigenfunctions tpj. 


Assumption 8 . The value r is finite and there exist constants s > 0, c* > 0 
such that, for any 1 < j <r, ||(^j||s ,2 < c*. 

Denote by /Cr(s, c*; Amax) the class of all kernels K dehned by (E]) with 
eigenfunctions tpj satisfying Assumption [H] and such that ||A'||oo < Amax- 

Theorem 9. Leth = max (|'u 2 s+i], \{rn)2 ('>+i)]), n > 1, 1 < r < oo. For 
any s > 0, c* > 0, Amax > 0 we have 

sup E[||A(^^) — AII 2 ] < Cmin (11) 

K£!Cr(s,Ct J Amax) 


where C > 0 is a constant depending only on Amax, cr and c*. 
Proof. It is enough to observe that, for all K G /Cr(s, c*; Amax), 


ll^ll 


2 

6,2 


r 


k=l 
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implying that /Cr(s, c,; A^ax) C /Cx(s, p; Amax) with p ^ Bound 

jm now follows from flTOl) . □ 

2s 

When r is a fixed constant and n is large, the rate in flTT]) is 0(n~^^). 
The next theorem shows that this rate cannot be achieved by the corrected 
empirical covariance estimator whatever is the choice of 1. 

Theorem 10. Let n > 1, 1 < r < oo. There exists c* > 0 such that for any 
s > 0, Amax > 0 we have 

inf snp E[\\A^^^ - K\\l] > Cn-^ (12) 

^>lKG/Cn(s,C.;An,ax) 

where C > t) is a constant that can depend only on Amax, s and c*. 


Proof. Fix / > 1 and consider the function 




xfc = l 


21 

E 


ek{t) ^ 

J.s+l/2 j ’ 


tG [0,1], 


where Ci is a normalizing constant, depending only on s, such that Hv^ilh = 1- 
By an easy computation, |l<^i||<i ,2 < c' for a constant c' depending only on s. 

Set K(t,u) = AmaxV^i(^)v^i(w). Then K G /Cr(s, c*; Amax) with c* = d. 
Due to dH]), 

sup E[||y4(^) - iF||| > sup - F :||2 + ^ 

K€Kr{s,Ct i Amax ) KGlCr{s,C, ;Amax) ^ 

> ||iF«-iF||2 + ^. (13) 

n 

Observe that 

(pi (g) (pi = ipf (g) + [ipi - ipf) (g) ipf> +ipi® (ipi - ipf). 

Therefore, 

\\Ti®Ti - tV ®Tf\\l = II+ ll^^i <2) ((^1 -Tf)\\l 

> Il</?ill2ll<^i -rf^ll2 = WPi-pfWl- 


This implies that 

II A-® - K\\l > A^„||p, - > cA^„r^* 

for some constant c > 0 depending only on s. 

Using this inequality in flT^ and taking the minimum over I > 1, we 
obtain the result. □ 
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4. Adaptive Estimation 

We observe that the optimal choice of the parameter / in theorems [7] and 
ID depends on the unknown parameters p, s and r that quantify respectively 
the smoothness of the eigenfunctions of K and their number. In this section, 
we propose an adaptive estimator, which does not depend on s and r that 
attains the same rate as in Theorem [7] or in Theorem |9l 

First, we describe a general method of aggregating estimators. Assume 
without loss of generality that the sample size n is even. We split the sample 
of n trajectories X = {Xi,..., into two parts of equal size n/2, denoted 
Xi = {Xi,..., Xn/ 2 ] and X 2 = {X„/ 2 +i,..., X„}. Fix an integer L. Using 
the sample Xi, we construct a family of estimators A'^^\ ..., such that 
G iS/, 1 < / < L. These can be, for example, the estimators A^^\ ..., A^^'^ 
defined in (|1]). 

Consider the following adaptive selector of /: 

i = argmin{||A«||2 - 2(A«,.R« - (14) 

1<1<L 

where R^n\t, u) = I ELn/ 2+1 xf\t)xf\u) is the projected empirical covari¬ 
ance function associated to the second subsample X 2 . 

In the following theorem we assume that the first subsample is frozen, so 
we state the result for non-random functions A^^'> E Si, 1 < I < L. 

Theorem 11. Let A^^\ 1 < ^ < T, he functions such that E Si. For any 
t > 0, with probability at least 1 — with respect to the subsample X 2 we 
have 

- K\\l < 2 min max 

for all K satisfying ([2]) with < 00 . Here, C > 0 is an absolute 

constant. 

Proof. Fix an arbitrary I E {1,..., L}. Note that, by definition, is a 

nested sequence satisfying 


t -I- log L / f -I- log L 


n 


n 


Si+i = Si & l.s. {cj 0 ei+i + ez+i ®ej, I < j < 1} . 
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Consequently, for any 1 </,/'< L, we have {A^^\R^n) = {A^^\Rn^^^). 
Similarly J’C) = = {A^^\ . Combining this observation 

with (IT^ . we get 

\\A^i)_K\\l-\\A^^-K\\l 

= - 2(A«, - [\\A^^\\l - 2(A®, J^®)] 

< ||i«||2 - - a^/W) - [||AW||2 - - a^/W)] 

+ 2(/l« - 

< 2 ( 74 *'*^ — A^^\ _ j^Cm) _ (j2j(Ct)^_ 

Here, K^^i) _|_ ^ 2 j(ivl) _ Setting for brevity m = 1 V I we deduce 

from the previous display that 

11^(0 _ K\\l - pW - K\\l < 2f/pW - H^lls < + 6P 

where U = max,=i,...,L(f/z, with Ui = (HW-HW)/P«-H «||2 

if A^’’"' 7 ^ and = 0 otherwise. It follows from the last display and the 
bound 

l||yl® - A^X < - K\\l + ip® - K\\l 

0 3 3 

that 

pW - K\\l < 2pW - K\\l + 9P. (15) 

Since I is arbitrary, to complete the proof it suffices to bound the random vari¬ 
able U in probability. We first obtain a bound for each of the variables (i = 
{Ui, rI^^ — E[.Ri”*^]). Note that associating Ui with the corresponding mxm 
matrices that we will also denote by Ui, we can write Q = {Ui, B — B) where 

B = (2/n) B = = E[xi(m)xi(m)’^], 

and Xi{m) are i.i.d. normal vectors with mean 0 and covariance matrix B 
(cf. the proof of Lemma [2]) and (•, •) is the inner product of matrices. It 
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follows that 


n 

Cl = - V - J(™)) 

n 

i=n/2+l 

9 n 

= tr(^- B^/^UiB^^‘^ZiZj - 

i=n/2-\-l 
2 ^ 

= E 27C2.-tr(I>) 

^=n/2H-l 


where Zi,..., Z„ are i.i.d. standard normal vectors in IR™ and D = . 

By the Hanson-Wright inequality (see, e.g., R.ndelson and Vershynin f 2ni3h ) 
we have that for any f > 0, with probability at least 1 — e“*, 


2 

- ZjDZi-b(D) 

2=n/2H-l 



( 16 ) 


where C > 0 is an absolute constant. Since \\Ui \\2 < 1 when considering Ui 
as a function (which is equivalent to ||Hz||i? < 1 when considering Ui as a 
matrix) and ||i?||oo < Amax + cr^ we have ||iA||oo < ||£*||f < Amax + cr^- Thus, 
with probability at least 1 — e“* 


101 < C'(A„,ax V a") 



where C* > 0 is an absolute constant. The union bound argument gives that, 
with probability at least 1 — e“*. 


f/2 


max Cf < ^(Amax V 

I — 1 ,. ...L 


/ *+‘°g^ I 

V n 


t + \ogL 
n 


2 


where C* > 0 is an absolute constant. Combining this with flT^ proves the 
theorem. □ 

We now apply TheoremfTTlto where the estimators A^^\ ..., 

are dehned in (|1|). Combining Theorems [3], [TT] and the fact that, for a random 
variable r], ]E[|r 7 |] = P (Ir^l > t) dt we get the following result. 
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Theorem 12. Let each of the estimators satisfy the conditions of The¬ 
orem 0 Then 


E 


- K\\l 


< C min inf 

1<1<L A£Si,A>0 


{||/1 - /f 11^ + v„(A, I ,;) 


for all K satisfying (Ej) with < oo. Here, C > Q is an absolute 

constant. 


We now fix L = n. Using Theorem [121 Theorem (3 Theorem [9] and 
Corollary m we obtain the following result. 

Theorem 13. Let each of the estimators = A^^'> satisfy the conditions of 
Theorem 0. Let A^’-l be the aggregated estimator with I defined in with 
L = n. 

(i) For any r > 1, c* > 0 and s > 0 such that 1 < r < , we have 

sup — K\\\ < C min {rn ~, 

K. G/Cr (SjC* ; Amax) 


where C > Q is a constant that can depend only on Amax, s. 

(a) For any r > 1, p > 1, s > 0, Amax > 0, > 0 such that < 

(Amax + cr^)^min we have 

sup E||i(')< Cmin 

K G/Cr (-5,p; Amax) 

where C > Q is a constant that can depend only on AmaxW^,P; o,nd s. 

(Hi) If (r A l)l > logu and I < n, then for any Amax > 0, 

sup E[||i« - K\\l] < 

^e/C,,i(Aniax) ^ 


f \ 2s/ (2s+l) 

nJ 


n 


-s/(s+l) 


where C > Q is a constant that can depend only on Amax and a^. 

The conditions r < and p^ < (Amax + niin are 

rather mild. Indeed, if r and p are fixed quantities, then these conditions 
are satisfied for n large enough. Theorem [13] shows that the estimator 
AT 

is adaptive to the unknown parameters r and s on the scale of classes 
/Cr(s, p; Amax) and /Cr(s, c*; Amax) that no price is paid in the rate as compared 
to the non-adaptive estimators of Theorems [7] and [9l The same estimator is 
adaptive on the scale of classes /Cr,z(Amax), again with no price to be paid, 
for a wide range of values of I and r. 
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5. Estimation of cr^ 


We now tackle the estimation of the nnknown variance We use the 
simple idea that (e^, S) becomes negligible for large I when Assumption [S] is 
satished. Therefore, we propose the following (biased) estimator of based 
on an independent copy X of the process ([T]); 


d2 = 


M 


XiL+M) _ j^{L) 


L = e^,M>l. 


(17) 


Theorem 14. Let n,l > 1 be integers and let be i.i.d. re¬ 

alizations of the process X(-) satisfying (If). Let Assumption\E be satisfied. 
For any t > 0, we have with probability at least 1 — 


^9 2 I ^ 

a — (T >, max 


^*(1 V \/t Vt), 


77^ f± ±\ 

V M’ MJ ■ 


Proof. We have, in view of Plancherel inequality, that 

|2 




M 


1 

M 


lyiL+M) _ py(L) 
L+M 


a 


-| L-fKI ^ L-\-M 

E +17 E +17 E 


M 


l=L 


l=L 


l=L 


(18) 


where zl, ..., zl+m are i.i.d. standard normal random variables also inde¬ 
pendent from S. 

We now take the expectation 


E 


L+M r 

i7EE^i(i’i'e‘>"- 

l=L j=l 


Note that {(pj,ei)‘^ < \\Fj — view of Assumption [HI we get 

.. L+M r 2 L+M-1 

Jf ~ clrX^^^L-^L 

l=L j=l l=L-l 
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The bound in probability follows easily from the representation ([TH]) . Indeed, 
the second term can be treated using standard deviations bounds for Gaus¬ 
sian combined with a conditioning argument. The third term can be treated 
with a standard deviation inequality for chi-square distributions. The first 
term can be treated using flTbl) again. More specifically, set = (.^i,..., 
and A = (aj,j')i<j,j'<r with 


Then, we have 




M 


L+M 

l=L 


1 

M 


L+M 


l=L 


-E 


1 

M 


L+M 


l=L 




with < clrX^^L and ||A||oo < cly/rX^^^L 

An union bound argument gives the result. Details of the proof are omit¬ 
ted here. 

□ 


6. Minimax lower bound 

In this section, we show that the upper bounds of Corollary 0] and Theo¬ 
rem M cannot be improved in a minimax sense. 

Theorem 15. Let 1 < r < oo and let A^ax > 0 6e a given constant. Then 
there exist absolute constants Cq > 0 and 0 < Ci < 1 such that, for any 
integers n and I satisfying I > 2, n > I, we have 

inf sup (\\kn- K\\l> Co[Amax A ^ 

Kr^ KeK.r,l(\ max) \ Th J 

where infj^^ denotes the infimum over all estimators of K. 

Proof. Let first r <1/2. Consider the vector-functions e{t) = (ei(f),..., ei{t)) 
and (fit) = {(flit ),..., (frit)) and a subset of /Cr,i(Amax) composed of kernels 
K satisfying (|2]) with Xj = 7 and 

(p{t) = He{t) 
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for suitable 7 > 0 and suitable r x I matrices H. Orthonormality of functions 
ifj implies that H must satisfy HH^ = where is the r x r identity 
matrix, i.e., the rows of H should be orthonormah To each such matrix H 
we associate a linear subspace Uh of IR^, which is the linear span of the r 
rows of H. Clearly, dim(17//) = r and H is the orthogonal projector onto 
Uh in Hh 

Note that the set of all such spaces Uh is the Grassmannian manifold 
Gr(M}), i.e., the set of r-dimensional linear subspaces of Mh The Grassman¬ 
nian manifold Gr{M}) is a smooth manifold of dimension d = r{l — r). A 
natural metric d{-, •) on GriJS}) is defined as follows: for U,IJ E 

d{U, U) = \\Pu - Pu\\f = \\H^H - H~^H\\f 


where Pu is the orthogonal projector onto U and H, H are the r x / ma- 
tri ces with orthon orma l rows associated to U and U respectively. We refer 


to IMattilal (119951) and iMilnor and StashefB (Il974l) for more details on the 
Grassmannian manifold. 

From now on, we will identify U G Gr{M}) with the associated orthogonal 
projector Pu = H. The b ehavior of ent ropy numbers of the Gras sman¬ 
nian m anifold is well studied f SzarekI f 1982I) . see also Proposition 8 in Paiorl 
fll998l) j. In particular, for any e G (0,1) there exists a family of orthogonal 
projectors U C Gr(M}) such that 


M> 



and cey/r < ||P - < ■Fe^/r, VP, Q eU, P ^ Q, 


(19) 


for some small enough universal constant c > 0. Here \IA\ denotes the car¬ 
dinality of lA. We take in what follows e = 1/2. Set N = \Ll\ and lA = 
{Pi,..., P/v}. The associated P-matrices will be denoted by Hi,..., Hh- 
Let Kj be a kernel of the form ([2]) with eigenvalues A* = 7 , i = 1,..., r, and 


where 7 = a((T^ AAmax)y^ and a E (0,1) is an absolute constant to be chosen 

later. Gonsider the set K' = {Pi,..., K^}- Glearly, we have K' C /Cr,i(Amax)- 
We now evaluate the Kullback-Leibler divergence between two probabil¬ 
ity measures induced by the observations {Xi(t),..., Xn{t),t G [0,1]} corre¬ 
sponding to the kernels Ki and Kj (with j ^ 1). Using the Girsanov formula 
and the fact that Kj is bilinear in {ek} it is easy to check that this divergence 
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is equal to the Kullback-Leibler divergence between the n-product distribu¬ 
tions of the associated Gaussian vectors ^ ei(t)dX{t), ■ ■ ■, /q ei(t)dX(t)'^. 

If K = Kj this vector is distributed as A/'(0, with = a‘^Ii + 'yPj = 
{a‘^+'y)Pj+a‘^P^ and P^ = h—Pj- Denote the corresponding Gaussian mea¬ 
sure by Pj and by P®” its n-product. Let KL(P, Q) be the Kullback-Leibler 
divergence between two probability measures P and Q. 

It is easy to see that all matrices Sj have the same eigenvalues. Thus, for 
any 2 < j < N we have 


KL(Pf’",P®’") = nKL(Pi,Pj 


Tl 

= 2 [tr(Er‘S,) - ; - log (det(Er‘E,))] 


= 5Nsr‘(s,-Ei)], 

Now, Er‘ = which yields 


tr(Er‘(E, - El)) = -j^tr(Pi(Pi - Pi)) + - Pi)) 


7 


7 


CT"' -f- 7 cr^ 
7^ 

2 (cr^ -|- 7 )cT^ 

„2 


7 (‘■■(■PiP, 


in-PiiiJ. 


< 


r'f 


8c^{a‘^ + 'y)a^ 

where for the last inequality we have used flT^ with e = 1/2, and the fact 
that tr(PiPj) = r — ||Pi — Pj|||i/2. Gombining the last two displays, we hnd 


KL(Pf”,P®”) < a 


2 (AmaxAa^)^ 

8 cP { a '^ + 7)0-2 


rl, V 2 < j < N. 


Recall that we assume r < 1/2, so that the dimension of the Grassmannian 
satishes d = r{l — r) > rl/2. Gonsequently, in view of flT^ . we have that 
log \U\ > crl for some absolute constant c > 0. Thus, we get 

KL(Pr,Pf)<^log|Lf|, V2<j<iV, 

provided a > 0 is taken sufficiently small independently of r,l, n, a, Amax- 
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Next, for any 1 < f, j < iV with i ^ j, 


IIA'j - KjWl = > ca‘lo‘ A ALxI^, 

where c > 0 is a absolute constant and the last inequality is due to ffT9|) . The 
re sult now fo l lows from the last two displays by application of Theorem 2.5 
in iTsvbako^ fePOQl ). 

Finally, consider the case r > 1/2. Note that the classes /Cr,i(Amax) are 
nested in r. Assuming w.l.o.g. that I is even, we get that the minimax risk 
over /Cr,z(Amax) is bounded from below by the minimax risk on /Cz/ 2 ,z(Amax)- 
But the minimax risk on has been already treated above and we 

have proved that the lower rate is of the order /^/n, which is the desired rate 
when r > 1/2. □ 


Remark 16. It is possible to prove a minimax lower bound ensuring that the 
bound in Theorem is optimal at least regarding the n dependence. Indeed, 
by a similar argument to that used in the proof of Theorem lTR we can prove 
the existence of an absolute constant 0 < C 2 < 1 and a constant C 3 > 0 
possibly depending on Amax,P, such that, for any integer n > 1 we have 

inf sup P — iF ||2 > C 3 min r? 7 ,“ 2 STT, > ^2 

K€TCr{s,p-,\n.^.^) ^ ^ 


where inf^^ denotes the infimum over all estimators of K. Specifying the 
dependence of the minimax rate on parameters cx^, Amax, P ,remains an in¬ 
teresting open guestion. 
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